Getting started GenAI & LLM with my Udemy course, Hands-on Generative AI Engineering with Large Language Model 👇
Introduction
Creating a tailored instruction dataset for fine-tuning a language model is a critical step in enhancing the model’s capabilities for specialized tasks. This guide provides a step-by-step example of how to create an instruction dataset.
Before starting, it is essential to define the dataset’s intended purpose. Are you developing a chatbot, a story generator, or a question-answering system? Clearly understanding the desired model behavior will guide the type and structure of the data you prepare.
In this example, our goal is to:
Create an instruction dataset suitable for fine-tuning a pretrained Large/Small Language Model using LORA/QLORA to produce a story generator designed for 5-year-olds.
Use Case
For demonstration purposes, we use:
The raw dataset TinyStories, introduced in the paper TinyStories: How Small Can Language Models Be and Still Speak Coherent English? by Ronen Eldan and Yuanzhi Li. This dataset consists of short, synthetically generated stories created by GPT-3.5 and GPT-4 with a limited vocabulary, making it highly suitable for our intended 5-year-old readers. The dataset is divided into two splits: train (2.12M rows) and validation (22K rows). For this use case, we will use the train split with 10K rows.
Below is a sample view of the TinyStories dataset on the Hugging Face Dataset Hub:
To create the instruction dataset, we will generate synthetic instruction sentences that correspond to each story in the TinyStories dataset.
Implementation
Step 1: Load Required Packages
Begin by loading the necessary packages:
import concurrent.futures
import json
import re
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple
from datasets import Dataset, load_dataset, concatenate_datasets
from openai import OpenAI
from tqdm.auto import tqdm
from google.colab import userdataStep 2: Define Modular Functions
Next, we define key functions to structure our pipeline.
Extracting Stories The get_story_list function creates a list of stories from the raw dataset:
def get_story_list(dataset):
return [example['text'] for example in dataset]Managing Instruction-Answer Pairs
The InstructionAnswerSet class defines a structure to store and manage instruction-answer pairs, with methods to create instances from JSON and iterate over pairs:
class InstructionAnswerSet:
def __init__(self, pairs: List[Tuple[str, str]]):
self.pairs = pairs
@classmethod
def from_json(cls, json_str: str, story: str) -> 'InstructionAnswerSet':
data = json.loads(json_str)
pairs = [(data['instruction_answer'], story)]
return cls(pairs)
def __iter__(self):
return iter(self.pairs)Generating Instruction-Answer Pairs
The generate_instruction_answer_pairs function takes a story and an OpenAI client as inputs to generate instruction-answer pairs using GPT-4. The function crafts a prompt to create relevant instructions while adhering to specific formatting requirements:
def generate_instruction_answer_pairs(story: str, client: OpenAI) -> List[Tuple[str, str]]:
prompt = f"""Based on the following story, generate an one-sentence instruction. Instruction \
must ask to write about a content the story.
Only use content from the story to generate the instruction. \
Instruction must never explicitly mention a story. \
Instruction must be self-contained and general. \
Example story: Once upon a time, there was a little girl named Lily. \
Lily liked to pretend she was a popular princess. She lived in a big castle \
with her best friends, a cat and a dog. One day, while playing in the castle, \
Lily found a big cobweb. The cobweb was in the way of her fun game. \
She wanted to get rid of it, but she was scared of the spider that lived there. \
Lily asked her friends, the cat and the dog, to help her. They all worked together to clean the cobweb. \
The spider was sad, but it found a new home outside. Lily, the cat, and \
the dog were happy they could play without the cobweb in the way. \
And they all lived happily ever after.
Example instruction: Write a story about a little girl named Lily who, \
with the help of her cat and dog friends, overcomes her fear of a spider to \
clean a cobweb in their castle, allowing everyone to play happily ever after. \
Provide your response in JSON format with the following structure:
{{"instruction_answer": "..."}}
Story:
{story}
"""
completion = client.chat.completions.create(model="gpt-4o-mini",
messages=[
{"role": "system",
"content": "You are a helpful assistant who \
generates instruction based on the given story. \
Provide your response in JSON format.",},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
max_tokens=1200,
temperature=0.7,)
result = InstructionAnswerSet.from_json(completion.choices[0].message.content, story)
# Convert to list of tuples
return result.pairsCreating the Instruction Dataset
We now wrap the previous functions into a final function create_instruction_dataset:
def create_instruction_dataset(dataset: Dataset, client: OpenAI, num_workers: int = 4) -> Dataset:
stories = extract_substory(dataset)
instruction_answer_pairs = []
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(generate_instruction_answer_pairs, story, client) for story in stories]
for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
instruction_answer_pairs.extend(future.result())
instructions, answers = zip(*instruction_answer_pairs)
return Dataset.from_dict({"instruction": list(instructions), "output": list(answers)})Step 3: Orchestrating the Pipeline
The main function orchestrates the entire pipeline:
+ Initialize the OpenAI client
+ Load the raw TinyStories dataset
+ Create instruction dataset
+ Perform train/test split
+ Push the processed dataset to Hugging Face Hub
def main() -> Dataset:
# Initializes the OpenAI client
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
# Load the raw data
raw_dataset = load_dataset("roneneldan/TinyStories", split="train[:10000]")
# Create instructiondataset
instruction_dataset = create_instruction_dataset(raw_dataset, client)
# Train/test split and export
filtered_dataset = instruction_dataset.train_test_split(test_size=0.1)
# Push the processed dataset to Hugging Face Hub
filtered_dataset.push_to_hub("tanquangduong/TinyStories_Instruction")Step 4: Authenticating and Running the Pipeline
Authenticate with the Hugging Face Hub and execute the pipeline:
from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))
# Launch the pipeline to create instruction dataset
main()Result
The resulting instruction dataset will look like this:
Conclusion
In summary, this guide demonstrated the creation of an instruction dataset tailored for fine-tuning. We first defined the purpose of fine-tuning and structured the dataset accordingly. By leveraging GPT-4, we generated instructions for each story using best practices in prompt engineering, including precise instructions, a one-shot example, and a specified output format. Finally, the processed dataset was uploaded to the Hugging Face Hub for future use.